Scaling of human behavior during portal browsing 
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We investigate transitions of portals users between different subpages. A weighted network of 
portals subpages is reconstructed where edge weights are numbers of corresponding transitions. 
Distributions of link weights and node strengths follow power laws over several decades. Node 
strength increases faster than linearly with node degree. The distribution of time spent by the user 
at one subpage decays as power law with exponent around 1.3. Distribution of numbers P{z) of 
unique subpages during one visit is exponential. We find a square root dependence between the 
average z and the total number of transitions n during a single visit. Individual path of portal user 
resembles of self-attracting walk on the weighted network. Analytical model is developed to recover 
in part the collected data. 



PACS numbers: 



i.75.-k, 02.50.-r, 05.50.+q 



I. INTRODUCTION 

There are three categories of Internet users: the ob- 
servers limit they activity only to be in touch with fam- 
ily and friends, consumers treat the Internet as a source 
of information which they need in order to work and to 
plan the free time, hobbies, travel etc. The last category 
are the creators, they are active on newsgroups, forums 
or write their own blogs. The distributions of users for 
these three categories change really fast showing the evo- 
lution of the society. In Poland 53 percent of men and 
46 percent of women use Internet while most of them are 
observers and consumers. 

The web portals gained special attention because they 
are, for many users, the starting point of their web 
browser. The simple structure with homepage aggregat- 
ing almost all links allows inexperienced users to find the 
basic information. For the more practiced user the web 
portal is a usual habit, every day they follow news at 
portals. The set of web pages they choose, the browsing 
art and the time they spend reading the site's content re- 
flect their pastime habits and interests in the real world. 
They combined two spaces real and virtual in an equi- 
librium. The analysis of the statistics of users behavior 
can bring a deeper insight about human dynamics on 
the web. This issue was thoroughly discussed in articles 
P, 0, H, 13) Hi- Our goal in this paper was to analyze 
the behavior of users browsing portals (portal consists 
of subpages and one of them is a home page). Visitors 
behavior is described by the way they navigate between 
subpages, how much time they spend at each of them, if 
and how many times they come back to the previously 
visited subpages. Knowledge about these habits can be 
useful for portals designers, as well as for planing mar- 
keting strategy that should fix an optimal number and 
distribution of advertisements. 

We organized the present paper as follows. In Sec |TT] 
we explain details of the dataset we use, in Sec. Ill we 
present various statistics of portal browsing. We con- 
struct a weighted network of subpages and analyze stan- 



dard properties [y| of this network. In Sec. IV we focus 
on the users browsing strategy. The way of navigating 
between portal subpages is modeled as a special self- 
attracting walk and the simulation results are given in 
Sec.V. We also develop a simple analytic approach (Sec. 
VI) that in part fits to numerical data. 



II. DATASET 

The analysis was based on cookie statistics provided 
by Gemius company for two Polish portals. Cookie is a 
small text file that every web browser automatically re- 
ceives from a web server while visiting a web-page. It al- 
lows to differentiate users and to maintain data related to 
them during navigation. The shortcoming of this method 
is that some users can delete their cookies and get a new 
one next time they enter the Internet. Therefore the 
number of cookie users does not correspond to the num- 
ber of real Internet users. The difference is the larger 
the longer the time of data collection, being negligible 
for a day , but immensely significant for a month. For 
this reason we used for our analysis daily data only. Our 
data was collected between 27/07/07 and 28/07/07 at 
two portals with large numbers of daily visitors (portal 
A had more than million and portal B around 3.9 mil- 
lion cookie users during twenty- four hours). The internal 
structure of the portal, i.e its division to subpages, was 
taken into account. On the day of investigations the por- 
tal A consisted of iV = 195 active subpages, where as an 
active page we counted a subpage that was visited at least 
once during the analyzed day. The portal B consisted of 
iV = 515 active subpages. A number of active pages can 
vary from day to day. The data gathered for one web-site 
included: user ID, subpage ID, time of user's arrival at a 
subpage (and more precisely, the time she or he clicked at 
this page). It should be emphasized that the whole col- 
lection of times and subpages ID for a given user, which 
we will refer to as a "visit chain" in due course, cannot be 
identified with the time physically spent by the user at 
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the subpages. This is a natural consequence of the fact 
that we detect users activity only by their transition to 
new subpages and we do not have any information about 
their real behavior between two such events. In particu- 
lar, we are unable to verify whether a user spent the time 
between two transitions reading the subpage content or 
whether he left it very quickly, started other activity (e.g. 
a phone call) and then came back to browse the Internet 
again. 



III. NETWORK STRUCTURE AND TRAFFIC 
AT THE PORTAL 




We constructed a weighted network of subpages, defin- 
ing a link weight as a number of users moving from one 
subpage (vertex) to another. We observed that the num- 
ber of transitions from a node i to a node j, m(i — s- j), is 
roughly equal to the number of transitions in other direc- 
tions m(i — > j) ~ m{j — !■ i). We simplified the network 
topology by introducing undirected links with weights: 

Wij ^ m{i j) +m{j ^ i). (1) 

One can observe a frequent habit to retract to the sub- 
page that was previously visited, so one visitor can pass 
many times over the same link. Therefore, the maximum 
link weight can be larger than the total number of users 
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FIG. 1: The weight distribution on the right, and the strength 
distribution on the left (black circles - portal A; color (red) 
square - portal B). 

weight distributions are power law with the characteris- 
tic exponents 7 = 1.5 for both portals. Defining a node 
strength in a usual way: 

S^ = Y^ (2) 

j 

we found that the strength distribution is close to P{s) ^ 
1/s in both cases. In order to get more precise analysis 
of the topological properties of the weighted network we 
should look for the correlations between a node strength 
Si and its degree ki. If a linear relation s = {w)k were ob- 
served, the correlations would be absent. In our system 
we find a positive correlation: Fig. [2]presents a power law 
dependence s ^ with the exponent (3 larger than two 
thus strengths of vertices grow faster than their degrees. 



FIG. 2: Relation between average strength of a node and the 
degree of a node (black circles - portal A; color (red) square 
- portal T^'' 




t [min] 

FIG. 3: The distribution of time spent by the user at one 
subpage. 



A similar positive correlation was observed in many real- 
world networks, e.g. in the world-wide airport networks 
[7] . In the work Q an explanation of this positive corre- 
lation for a large class of networks is explained as result 
of preferential flow allocation. 

The time spent by a user at one subpage is under- 
stood as the time between two consecutive clicks on two 
different subpages performed by this user. Correspond- 
ing time distributions were analyzed by Dezso et al. [l| 
and by Gongalves and Ramasco who found power law 
relations with exponents 7 = 1.2 and 7 = 1.25 respec- 
tively. In our case we measured 7 — 1.27 for portal A 
and 7 = 1.32 for portal B and the scaling was valid for 
the range over two decades (see Fig. [3]). The model of 
separately executed tasks M gives 7 = 1 while the model 
of (bounded) tasks groups Q leads to 7 > 1. 

The weighted network we investigated was created by 
a group of users. To understand properties of this net- 
work we followed details of users paths. We analyzed the 
distribution of numbers of unique subpages z visited by 
a user during a single visit (Fig. 3]) and we observed an 
exponential behavior with a unique characteristic param- 
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• portal A 
■ portal B 
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FIG. 4: Distribution of numbers of unique subpages z visited 
by a user during one visit. The parameter a of exponential 
fitting is 0.31 for portal A and 0.41 for portal B. 

eter a for a given portal: 

P{z) ^ Acx^i-az) (3) 

Let us consider the relation between two variables: a 
number of jumps (transitions) between subpages n and 
the average number of distinct (unique) subpages (z) cor- 
responding to the same fixed n value. One can refer to z 
as to the "interest horizon" since it describes the user's 
tendency to stick to limited subset of certain subpages. 
The relation is presented in Fig. [5] and for n < 100 can 
be described by the function: 

(z) ay/n. (4) 

For both portals we observed the same square root depen- 
dence with exactly the same parameter a. Now, having 
Eq.Q and the number distribution of unique subpages, 
we can find the formula for the distribution of number of 
jumps presented in Fig (S] Since 

P{n)dn = P{z)dz (5) 

we get: 

aA exp(-aaVn) 
2 V" 

As we observed in Fig. [5] this formula fits very well to 
the collected data. 



IV. VISITORS STRATEGY 

The analysis of statistical properties of the visitor be- 
havior reveals an important phenomenon, which are fre- 
quent returns to a subpage previously visited. One can 
ask how does a subpage at the portal affect the frequency 
of the return to this subpage. The most special subpage 
is the homepage, in most cases being the top of the de- 
cision tree of a user. We observed that more than 40 
percent of users in portal B, and 28.5 percent in portal 
A were browsing using a special star strategy: the visitor 
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FIG. 5: (color on line) Relation between the average number 
of distinct subpages (z) and number of jumps n. Red point 
data from portal A, black point data from portal B, line fitted 
with (z) ' ■ ■ '— 
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FIG. 6; Jump distribution and fit to Eq.©. 



starts at the homepage, then chooses one of other sub- 
pages and returns to the homepage in every odd step. 
At the even steps user reads one of the other subpages, 
sometimes revisiting some of them. The strategy can be 
disturbed by users making some slight deviations from 
the star path, but still they present the same tendency 
of returning to the previously visited subpage in a small 
number of steps. 

It is clear that if a user chooses a new subpage at ev- 
ery even step, we will not observe the square root re- 
lation [Eq. (HJ] but a linear correlation (z) ^ n/2. It 
means that the star strategy does not explain the scal- 
ing observed in the data. Therefore, we assume that the 
homepage is not the only subpage revisited during one 
visit chain. Let p* be a return probability to a subpage 
visited one step earlier (meaning that a user reads the 
same subpage at the step numbers n-1 and n+1). We 
restricted our calculation to the range of one hundred 
steps only, where the scaling is observed. In the Tab. 
U] we present the total return probability p* and its two 
components that describe the coming back to the home- 
page iphome) and to a different subpage {pdif ferent)- The 
total return probability is p* = phome +Pdif ferent- The 
large value of p* can be related to the use of the but- 
ton "back" or opening of a new "window" and jumping 
between two open browser windows. 
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portal A 


portal B 


probality p* 


0.54 


0.57 


probality puome 


0.29 


0.27 


probality Pdiff erent 


0.25 


0.30 



walker is located at the node i is given by [l9|, |2C 



TABLE I: Comparison of return probabilities data from portal 
A and B. 



V. SELF-ATTRACTING WALK AS A 
BROWSING SCENARIO 

The relation between the average number of distinct 
subpages (z) and the number of jumps n can be inter- 
preted as the relation between the average number of 
distinct states and the number of steps (understood as 
time) in the problem of random walk. This problem was 
considered by many authors, see e.g. @, In one 

dimension a random walk is characterized by the square- 
root relation between the number of distinct states and 
the number of visited states, (z) ~ ^/n. From the famous 
theorem of Polya 14] it is known that the probability of 
returning (at any time) to the starting point by a ran- 
dom walker in a d - dimensional lattice can be less than 
one only for d > 2. In this sense the dimension d — 2 
is critical for this dynamics, (z) ^ n/\ogn. For com- 
plex networks with scale-free degree distribution there is 
(z) ~ n (see [ul)- 

Various models of biased random walks were analyzed, 
see e.g. [H, [l6l. A special case is a model of self- at- 
tracting walk [l7l [isj where a state which was previously 
visited is preferred in the next time step but there are 
different scenarios of attracting relations. Here we adopt 
such a model for portal browsing. We took the topology 
of the network with weights between two nodes defined 
by Eq. The dynamics on the network is very sim- 

ple: each walker starts at the homepage and then, with a 
transition probability pij , moves to one of the neighbor- 
ing subpages. The probability of transition from vertex 
i to vertex j is proportional to the weight of this edge 
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After two initial steps a leaning towards return to a pre- 
viously visited page occurs as follows. In the step n a 
walker returns to a node visited at n — 2 step with proba- 
bility p* , and with probability 1 —p* he chooses a random 
neighbor, according to transition probability pij . The re- 
sults of such a simulation for both portals (Fig. [7]) are 
very close to the real data in the range of first thirty 
steps. 



(8) 



Unfortunately, in our case the walker life time at the 
network is not long enough to assume the stationary dis- 
tribution of probability pi because at the beginning of 
the walk the initial site is significant. Despite this non- 
stationarity, we find an approximated relation between 
the average number of distinct subpages (z) and the num- 
ber of jumps n. Let us define ds{n) as a fraction of ver- 
tices of strength s visited by a random walker at least 
once after n time steps. In our weighted networks the re- 
lation between (z) and n is analogical to the unweighted 
networks discussed in uM- 



{z}{n) = Nj2Pis)ds{n) 



(9) 



where N is the number of verticies - subpages of the por- 
tal. Changes of ds{n) are given as 
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= [l-ds{nMs). 



(10) 



Here p(s) is the probability that at a walker observed in a 
random time moment is at site of strength s. The above 
equation is true when p{s) is stationary and it is only 
an approximation during first steps of the walker's path. 
Now we extend this equation by taking into account the 
star strategy discussed in the previous section. Since the 
walker returns to the page visited two steps earlier with 
probability p* thus in such cases there is = 0. It 
leads to the equation 



dd* 

-^ = [l-d:{n)]p{s){l~p* 



(11) 



Taking into account the initial condition d*{0) = we 
get the solution 



dtin) = 1 - exp [-np{s){l - p*)] 
with the characteristic relaxation time 
r{s) ^ 



p(s)(l-p*) 



(12) 



(13) 



One can see that the relaxation process is slowed down 
by the factor I— p*. From Eqs. (0 and we have: 



z){n) = N - N^P{s)exp 



sn(l — p*) 

{s)N 



(14) 



VI. ANALYTIC APPROXIMATION 

Let us now consider a random walk at the weighted 
network. In the infinite time limit the stationary occu- 
pation probability pi describing the probability that a 



Since for infinite networks there is a divergence of the 
first moment of the empirically observed distribution 

P{s) ^ 1/s we used real data to estimate (s) = ' . 

Resulting solution (fH|) is presented in Fig. [7] and it fits 
well to numerical simulations discussed in Sec. V. 




FIG. 7: (Color on-line) The relation between the average number of distinct subpages and number of jumps. Portal A on the 
right and portal B - on the left; black points are data, lines are Eq. (I12|l . red triangles come from the numerical simulation. 
Results of numerical simulation (1 million artificial users and n = 1000 steps for each) fit perfectly to real data from portal 
for number of steps n < 30. After system termalization (more that 30 steps) we observed the agreement between the analytic 
calculation and simulations results. 



VII. CONCLUSIONS 

We show that a one-day user's interest horizon, mea- 
sured as the number of distinct visited subpages, is rela- 
tively small in comparison to the number of all transitions 
at the portals, i.e. to the number of all subpages visited 
by the user. It means that people return many times to 
the same subpage or pass by the same page during one 
-day visit session. There can be various explanations of 
this phenomenon. The large probability of coming back 
to the homepage can be a result of news portal structure, 
with a homepage being a network hub. However, since 
the probability of returning to any other subpage is also 
significant, it can suggest that it is somehow difficult for 
the users to find the information they need. So, if they 
consider a visited subpage inadequate to what they were 
expecting, they come back one step up the portal struc- 
ture and try to go to the other subpage. That could be 
a hint for the web-designers regarding the portal func- 
tionality. The other conclusion is that the range of the 
information seeking by the portals users is in fact re- 
stricted only to a narrow range of what they find most 
interesting. In this sense, Internet seems to be a perfect 
tool to keep an eye on changing situation in the regions 
of some importance to the users (for example stock mar- 
ket indexes, political news, topical portals). However, 
the existence of such a global and easily accessed 'knowl- 
edge mine' does not necessarily enlarge people's general 



interest horizon. The other possibility can result from 
the fact that portal visitors need frequently to pass over 
a few "transit" pages to come from one aim to another. 
Such a scenario would correspond well to the observed 
value of 7 > 1 (see Fig. ^ of times between consecutive 
clicks and the model of bounded group tasks proposed in 

Our simple model of a self-attracting walk shows that 
the real data is very well reproduced by a short memory 
process. The observed scaling relation between an av- 
erage number of distinct subpages (z) and a number of 
jumps n is well reconstructed using the strength of node 
as a popularity range and the rule of coming back to the 
previous page with probability p* . The solution of the 
rate equation fits well to simulation results for a number 
of clicks larger than n > 30. However, it is difficult to 
directly compare the developed model with the collected 
portal data because the number of users visiting more 
than 30 subpages in not large. This compliance allows 
us to understand the behavior of the users as a random 
walk with short memory. 
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